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Abstract 

The Expectation Maximization (EM) algorithm is a versatile tool for 
model parameter estimation in latent data models. When processing large 
data sets or data stream however, EM becomes intractable since it requires 
the whole data set to be available at each iteration of the algorithm. In 
this contribution, a new generic online EM algorithm for model param- 
eter inference in general Hidden Markov Model is proposed. This new 
algorithm updates the parameter estimate after a block of observations 
is processed (online). The convergence of this new algorithm is estab- 
lished, and the rate of convergence is studied showing the impact of the 
block size. An averaging procedure is also proposed to improve the rate of 
convergence. Finally, practical illustrations are presented to highlight the 
performance of these algorithms in comparison to other online maximum 
likelihood procedures. 

1 Introduction 

The Expectation Maximization (EM) algorithm is a well-known iterative algo- 
rithm to solve maximum likelihood estimation in incomplete data models |I12| . 
In this context, model parameter estimates are obtained by maximizing the 
log-likelihood of the observations Yq-t- Despite in incomplete data models the 
log- likelihood is not explicit, EM algorithm is generally simple to implement 
since it relies on complete data computations: each iteration consists in a E- 
step where the expectation of the complete log-likelihood under the conditional 
distribution of the latent data given the observations is computed; and a M-step, 
which updates the parameter estimate based on this conditional expectation. 

In many situations of interest, the complete data likelihood belongs to the 
exponential family. In this case, the E-step consists in the computation of 
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the expectation of the complete data sufficient statistic under the conditional 
distribution. In such case, the EM algorithm can be considered equivalently 
as an iterative algorithm in the space of the complete data sufficient statistics 
(instead of in the parameter space). 

The EM algorithm has been successfully applied for maximum likelihood 
inference in general state-space models. Except for simple models the E-step is 
intractable and has to be approximated e.g. by Monte Carlo methods such as 
Markov Chain Monte Carlo methods or Sequential Monte Carlo methods (see 
resp. [17]) depending on the complexity of the model. 

When processing large data sets or data streams however, the EM algorithm 
might become impractical. Online variants of the EM algorithm have been 
first proposed for independent and identically distributed (i.i.d.) observations. 
The first online procedure for parameter estimation was introduced in |29| by 
Titterington. This algorithm relies on a stochastic gradient approach which aims 
at incorporating the newly available observation. In Cappe and Moulines [l], 
the proposed algorithm is more closely related to the original EM recursion: in 
the case of an exponential complete-data likelihood, the E-step is replaced by a 
stochastic approximation step while the M-step remains unchanged. 

More complex incomplete data models such as Hidden Markov Models (HMM) 
are of common use to represent time series in many fields such as statistics, in- 
formation engineering and financial econometrics, see |151I31| . An online version 
of the EM algorithm for inference in HMM when both the observations and the 
states take a finite number of values (resp. when the states take a finite number 
of values) was recently proposed by Mongillo and Deneve (resp. Cappe [S|). 
In Cappe f3^, the algorithm relies on the ability to compute approximations of 
the filtering distribution and on an intermediate quantity based on the suffi- 
cient statistics. In order to update these computations recursively, stochastic 
approximation procedures are introduced. This algorithm has been extended to 
the case of general state-space models by substituting deterministic approxima- 
tion of the smoothing probabilities for Sequential Monte Carlo algorithms (see 
Cappe [2], Del Moral et al. [S^ and Le Corff et al.^). 

Despite the encouraging first results when applying these online EM algo- 
rithms, the convergence of these algorithms and the characterization of the 
limit points (when the number of observations tends to infinity) remain an open 
question. The convergence of the online variants of the EM algorithm for i.i.d. 
observations is addressed by Cappe and Moulines 21 ■ the limit points are the 
stationary points of the KuUback-Leibler divergence between the marginal dis- 
tribution of the observation and the model distribution. There do not exist 
convergence results for the online EM algorithms for general state-space models 
(some insights on the asymptotic behavior are nevertheless given in Cappe ^): 
the introduction of many approximations at different steps of the algorithms 
makes the analysis quite challenging. 

In this contribution, a new online EM algorithm is proposed for HMM with 
exponential complete-data likelihood. It sticks more closely to the principles 
of the original batch-mode EM algorithm. The M-step (and thus, the update 
of the parameter) occurs at some deterministic times {Tk\k>i i-e. we propose 
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to keep a fixed parameter estimate for blocks of observations of increasing size. 
More precisely, let {Tk}k>Q be an increasing sequence of integers (Tg = 0). 
For each A; > 0, the parameter's value is kept fixed while accumulating the 
information brought by the observations iTfc+i:Tfc+i- Then, the parameter is 
updated at the end of the block. This algorithm is an online algorithm since the 
sufficient statistics of the fc-th block can be computed on the fly by updating 
an intermediate quantity when a new observation Yt, t € {Tk + 1, . . . , Tk+i} is 
available. Such recursions are provided in recent works on online estimation in 
HMM, see [21131 ig. 

This new algorithm, called Block Online EM algorithm (BOEM) is derived 
in Section [2] together with an averaged version. Section|3jis devoted to practical 
applications: BOEM is used to perform parameter inference in HMM where 
the forward recursions mentioned above are available explicitly (this occurs e.g. 
for finite state-space HMM). In the case of finite state-space HMM, BOEM is 
compared to a gradient-type recursive maximum likelihood procedure and to 
the online EM of 

The convergence of BOEM is addressed in Section |4| BOEM IS seen as a 
perturbation of a deterministic limiting EM algorithm, the limiting behavior 
of which is studied through a Lyapunov- function technique. The perturbation 
is shown to vanish (in some sense) as the number of observations increases 
thus implying that BOEM inherits the asymptotic behavior of the limiting EM 
algorithm. Finally, in Section|5] we prove that the rate of convergence of BOEM 
strongly depends upon the block size sequence: this rate is optimal when the 
block size increases exponentially which is, quite unfortunately, of poor practical 
interest. Nevertheless, we prove that the averaged BOEM reaches this optimal 
rate of convergence for slowly increasing block size sequence. All the proofs are 
postponed in Section |6] supplementary materials are provided in [5T] . 

2 The Block Online EM algorithms 
2.1 Notations and Model assumptions 

Let Y = {Yt}t£i, be the observation process defined on {il,¥i,,J^) and taking 
values in where Y is a general space endowed with a countably generated 
cr-field S(Y). 

A HMM model parameterized by 0, for 6* in a set 8 C M'^«, is fitted to 
the observations: consider a family of transition kernels {me{x, x')dX{x')}g^Q 
onto X X B(K) where X is a general state-space equipped with a countably 
generated cr-field B{'K), and A is a bounded non-negative measure on (X,S(X)). 
Let {ge{x,y)di'{y)}g^e be a family of transition kernels on (X x B{Y)), where 
1/ is a measure on (Y,i3(Y)). 

For any initial distribution x on (^i'S(X)), any e O, any r < s < t and 
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any sequence y G Y , define the probability measure 3>g'J((-,y) by 

dcf / x{dxr){Y[]Zl TT^eixt, x^+i)ge{xi+i,yi+i)} h{xs-i,Xs,ys) dXjxr+i-.t) ^-^^ 
J x{dxr){l\lZl me{xi, Xi+i)ge{xi+i,yi+i)} dX{xr+i:t) 

for any bounded function h and where, for any r < t. we will use the shorthand 
notation Xr-.t for the sequence (xr, ■ ■ ■ , Xt). Note that if {{Xt, Yt)}t^z is a HMM 
with transition kernels mg and gg, is the conditional expectation of 

h{Xs-i,Xs,Ys) given Yr+i-.t when X^ ^ X- 

^e:Lih,Y) = Eg [hiXs-i,X,,Y,)\Yr+v.t] , X,. ^ x ■ (2) 

It is assumed that the HMM is exponential i.e. 

Al (a) There exist continuous functions : 8 -> M, : 8 and 
S-rXxXxY^M^ s.t. 

logmg{x,x') +\oggg{x',y) = (f>{e) + {S{x,x,' ,y),'ilj{0)) , 

where (•, •) denotes the scalar product on M"*. 

(b) There exists an open subset S of M'* that contains the convex hull of 
S'(X X X X Y). 

(c) There exists a continuous function ^ : 5 8 s.t. for any s G 5, 

e{s) = argmaxgge + {s,^{0))} ■ 

2.2 Block Online EM (BOEM) 

Define 

S-^^i9,Y)''^'l J2 $^f^^.(5,Y) . (3) 

t=T+l 

Once again, note that if {{Xt,Yt)}tez is a HMM with transition kernels mg 
and gg, Sf''^{d,Y') is the conditional expectation of the additive functional 
Y.J=T+i S{Xt-i,Xt,Yt) given Yt+i-.t+t when Xt ~ X- 
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S^-^{e,Y) = - V Eg[SiXt-i,Xt,Yt)\YT+i:T+r], Xt 

T ^ — ' 



t=T+l 



Note that ([3| can be computed without any storage of the observations on 
the block: the algorithm is not faced to any memory capacity issue, this is a 
streaming procedure (see Section [2^4] below) . 

Let {r„}„>i be a sequence of positive integers and set 



n 

tJ^Y.^, and Toli^O; (4) 



fc=i 
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T„ denotes the length of the block n. To ensure the stability of this stochastic 
iterative algorithm, we use a reprojection scheme adapted from Let {Qn}n>o 
be a sequence of compact subsets of 9 s.t. 

Vn > 0, e„ C e„+i and 6 = |J e„ . (5) 

n>0 

Given an initial value 9o G Oq and starting with po = 0, the BOEM algorithm 
defines a sequence {0„}n>i by 

?o otherwise and set p„ = + 1 . 



p„ counts the number of truncations; it is proved in Theorem 4.4 that {p„}„>o 
is finite w.p.l. i.e. w.p.l., 9n = 9^-1/2 for all large enough. BOEM updates 
the parameter estimates by using the integrals (|3| computed on non-overlapping 
block of observations; the expectation is with respect to (w.r.t.) a conditional 
distribution given the (random) observations Yt+i-.t+t- Consequently, it is a 
stochastic iterative algorithm. 

For ease of notation, it is assumed in this recursion that the initial distribu- 
tion X is the same for all blocks even though it will be clear in Section |4] that the 
initial distribution can change over blocks. We will choose a positive sequence 
{Tn}n>i s.t. lim„^.oo T„ = +00. Indeed, we will prove that VmvT^ooS^''^{9,'Y) 



exists Pj, — a.s (see Theorem 4.1 below), and it is thus expected that BOEM 
applied with such a sequence {t„}„>i will have the same asymptotic behavior 
as the iterative procedure in which 5^^"^""^ (^n-i, Y) is replaced by its limit. 
We will give a rigorous proof of this intuition in section |4] as well as rigorous 
assumptions on {r„}„>i. 



2.3 Averaged Block Online EM 

When r„ is large, S'^'-^(6',Y) may be seen as an estimator of the a.s. limit 
limT-_j.oo '5^'"^(^, Y). By analogy to the regression problem, an estimator with 
reduced variance can be obtained by averaging and weighting the successive 
estimates (see [T9l |26] for a discussion on the averaging procedures). Define 

Eq '= and for n > 1, 

= ;^E-.%^^-H^.-i,Y); (7) 

J = l 

note that this quantity can be computed iteratively and does not require to 
store the past statistics Sr^'^^^^ . Given an initial value 9q, the averaged BOEM 
algorithm defines a sequence {6'„}„>i by 

9^ = 9{^n). (8) 
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2.4 Comments on the implementation of the algorithms 



About the initial distribution x- in (|3|, the computation on each block is 
performed with the same initial distribution x- This simplifies the presentation 
of the algorithms and reinforces the readability of the proofs. Time dependent 
initial distributions could be considered, such as using the filtering distribution 
obtained at the end of block n to initialize block n + 1. In this case, the 
initial distribution depends on the past observations and the current parameter. 
Different strategies are numerically compared in Section [3] 

The asymptotic behavior of our algorithms is derived under so-called strong 
mixing conditions of the hidden chain: this implies the forgetting of the initial 
condition at a geometric rate, uniform in the initial distribution x- We prove 
asymptotic results for the algorithms described above (i.e. with a fixed initial 
distribution x) but our results remain true for time dependent initialization 
strategies. Details are omitted. 

Streaming: our algorithms update the parameter after processing a block of 
observations. Nevertheless, the intermediate quantity 5r^'^"~^ (^n-i, Y) can be 
either exactly computed or approximated in such a way that the observations 
are processed online. In this case, once received, an observation is used to up- 
date the intermediate quantity and then removed from the memory. The exact 
computation is detailed in [3, Section 2.2] and [8, Proposition 2.1] and can be 
applied e.g. to finite state-space HMM. jB] proposed a Sequential Monte Carlo 
approximation to perform this update online for more complex models (the con- 
vergence of BOEM combined with this method is addressed in [20] )■ Therefore, 
BOEM, its averaged version and its particle approximations can be described 
as streaming algorithms. 

About the block size {t„}„>i.' it is expected that, when the number of observa- 
tions tends to infinity, BOEM behaves like a limiting EM, i.e. an EM procedure 
with an infinity of observations. In Section |4] we characterize the asymptotic 
behavior of this limiting EM algorithm and, in order to inherit this asymptotic 
behavior, the number of observations per block used in BOEM has to increase 
to infinity. We will see in Sections |4] and |5] that polynomialy increasing sizes 
Tn ~ cn" (a>l) are enough. On a practical point of view, r„ can be constant for 
the first iterations so that the parameter is updated sufficiently enough in the 
first part of the run. Then, t„ increases like an"' and the user can choose c in 
such a way that the block sizes do not grow too fast. The influence of the block 
sizes on the convergence of BOEM and its averaged version are illustrated in 
Section Is] (see also Section Is] for the computation of the rates of convergence) . 
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3 Application to inverse problems in Hidden Markov 
Models 



In Section [XT] the performance of BOEM and its averaged version are illustrated 
in a linear Gaussian model. In this case, the quantity S*?^^""^ (^„_i, Y) can be 
exactly computed using a Kalman smoother but this requires to store the data 
on each block, ^t^^""^ (^n-i, Y) can also be approximated using a streaming 
procedure (i.e. without storing any data, see [20] for Sequential Monte Carlo 
methods) without modifying the limiting behavior of the algorithm. In the 
experiment below, we use the Kalman smoother. The role of the block size 



{''"n}n>i and of the initialization scheme are discussed in Section 3.1 



In Section [3. 2| BOEM is compared to online maximum likelihood procedures 
in the case of finite state-space HMM. 



3.1 Linear Gaussian Model 

Consider the Linear Gaussian model (LGM): 

Xt+i = (j}Xt + cjuUt , Yt = Xt+ cr^Vt , 

where Xq ^ A/" (0,0-^(1 — (j)'^)^^), {Ut}t>o, {^t}t>o £^re independent i.i.d. stan- 
dard Gaussian r.v., independent from Xq. Data are sampled using (f> — 0.9, 
cr^ = 0.6 and ti^ = 1. All runs are started with (j> = 0.1, cr^ = 1 and ti^ = 2. 

We illustrate the convergence of the BOEM algorithms. We choose r„ = 
a{n + 1). We display in Figure [l] the box and whisker plots for the estimation 
of (f> obtained with 100 independent Monte Carlo experiments; different values 
of a are also considered. Both BOEM and its averaged version converge to the 
true value (j) = 0-9; tti6 averaging procedure clearly improves the variance of 
the estimation. Figure [T] shows that the averaged procedure needs a few more 
iterations to converge but when compared to the non averaged one, the variance 
is much smaller. 
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(a) BOEM without averaging, t,i = a(n -|- !)■ (b) BOEM with averaging, r„ = a(n + 1). 

Figure 1: Estimation of (f> for a = 10 (left), a = 100 (middle) and a — 300 
(right) after 50, 100, 150, 200 and 250 blocks. 



We now discuss the role of the initial distribution x- The convergence results 
(see Section|4| show that our algorithms converge whatever x- Figure [2] displays 
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the estimation of (j> by the averaged BOEM algorithm with r„ ^ (n + 99) 
over 100 independent Monte Carlo runs as a function of the number of blocks. 
We consider first the case when x is the stationary distribution of the hidden 
process, i.e. x = ^/'(O, (1 — (j)'^)~^a^), computed with the current estimates, and 
the case when x is the filtering distribution obtained at the end of the previous 
block, computed with the Kalman filter. In terms of the error of the estimation, 
the two strategies are similar. We observe the same phenomenon for different 
values of 4> (plots are reported in Section 5]). Therefore, it is advocated to 
choose X as the filtering distribution obtained at the end of the previous block. 



4 t ^ - i 


9e 





^^umber of blocks'" 



Figure 2: Estimation of cj) after 5, 10, 25, 50, 100 and 150 blocks, with two dif- 
ferent initialization schemes: the stationary distribution (left) and the filtering 
distribution at the end of the previous block (right) . The boxplots are computed 
with 100 Monte Carlo runs. 



We now discuss the role of {t„}„>o. Figure |3] displays the empirical vari- 
ance, when estimating (j), computed with 100 independent Monte Carlo runs, for 
different numbers of observations and, for both the BOEM and its averaged ver- 



sion. We consider four polynomial rates t„ ~ n , b € {1.2, 1.8, 2, 2.5}. Figure 3a 



shows that the choice of {Tn}n>o has a great impact on the empirical variance 
of the (non averaged) BOEM path {9n}n>o- To reduce this variability, a solu- 
tion could consist in increasing the block sizes r„ at a larger rate although this 
implies practical difficulties: when r„ ~ n^, many observations are needed for 
each update of the parameter sequence. The influence of the block size sequence 



T„ is greatly reduced with the averaging procedure as shown in Figure 3b We 



will show in Section [5] that averaging really improves the rate of convergence of 
BOEM. 

In addition, it is not advocated to start the averaging procedure with too 
few observations, as illustrated by Figure [4] The first estimates highly depend 
on the initialization of the parameter and the averaging procedure should start 
after a burn-in period. 

As a conclusion, it is advocated to use the averaged BOEM algorithm. In 
practice, one could use slowly increasing sequences t„ for the first iterations, 
and then, use more rapidly increasing sequences after the burn-in period. 
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Figure 3: BOEM: empirical variance of the estimation of after n = 0.5£10^ 
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Figure 4: BOEM with averaging: empirical variance of the estimation of </> 
after n = 1000, 1500, 2500 and 3000 observations for different block size schemes 
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3.2 Finite state-space HMM 

We consider models where the unobservable states take a finite number of values. 
Mixture processes with Markov dependence, switching processes with Markov 
regime, communication channels driven by Hidden Markov processes, compos- 
ite sources with switch controlled by a Markov chain are examples of finite 
state-space HMM found useful in many fields including biostatistics, genomics, 
information theory, speech processing. . . (see e.g. [16^ for a review). In the nu- 
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merical applications below, we consider a Gaussian mixture process with Markov 
dependence of the form: Yt = Xt + Vt where {Xt}t>o is a Markov chain taking 
values in {xi, . . . ,Xd}, with initial distribution v and a, d x d transition matrix 
m. {Vt}t>o are i.i.d. Af{0,v) r.v.. independent from {^tjoo- Observations are 
sampled using d = 6, v = 0.5, Xi = i ,yi d {1, . . . ,d} and the true transition 
matrix is given in [211 Section 5.2]. 



3.2.1 Comparison to an online EM based procedure 

In this case, we want to estimate the variance v and the states {xi, . . . , Xd}- AH 
the runs are started from v = 2 and from the initial states {—1; 0; .5; 2; 3; 4}. We 
compare the averaged BOEM to the online EM procedure of [3] combined with 
a Polyak-Ruppert averaging (see [251). The algorithm in fS] follows a stochastic 
approximation update and depends on a step-size sequence {7n}„>o. It is ex- 
pected that the rate of convergence in L2 after n observations is 7„ (and l/\/n 
for its averaged version) - this assertion relies on classical results for stochastic 
approximation. We prove in Section |5] that the rate of convergence of BOEM 
is (and l/^/n for its averaged version) when r„ cx n'' . Therefore, 

we set T„ = n^'^ and 7„ = n~°'^^. Figure [s] displays the empirical median and 
first and last quartiles for the estimation of v with both algorithms and their 
averaged versions as a function of the number of observations. These estimates 
are obtained over 100 independent Monte Carlo runs. Both BOEM and OEM 
converge to the true value of v and the averaged versions reduce the variability 
of the estimation. Figure |6] shows the similar behavior of both averaged algo- 
rithms for the estimation of a:i in the same experiment. Nevertheless, while 
the online EM of [3 has an encouraging experimental behavior there is still no 
theoretical proof of convergence. Some supplementary graphs on the estimation 
of the states can be found in |2H Section 5]). 



3.2.2 Comparison to a recursive maximum likelihood procedure 

We want to estimate the variance v and the transition matrix m. All the 
runs are started from v — 2 and from a matrix m with each entry equal to 
1/d. The averaged BOEM is compared to a recursive maximum likelihood 
(RML) procedure (see |23l |5S]) combined with Polyak-Ruppert averaging (see 
|26j). RML follows a stochastic approximation update and depends on a step- 



size sequence {7n}n>o which is chosen in the same way as in Section 3.2.1 
Therefore, for a fair comparison, RML (resp. BOEM) is run with 7„ = n"*^-^' 
(resp. T„ = n^'^). Figure [t] displays the empirical median and empirical first 
and last quartiles of the estimation of m(l, 1) as a function of the number of 
observations over 100 independent Monte Carlo runs. For both algorithms, the 
bias and the variance of the estimation decrease as n increases. Nevertheless, 
the bias and/or the variance of the averaged BOEM decrease faster than those of 
the averaged RML (similar graphs have been obtained for the estimation of the 
other entries of the matrix m and for the estimation of v; see [2T| Section 5]). As 
a conclusion, it is advocated to use the averaged BOEM instead of the averaged 
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(a) Estimation of v with BOEM. (b) Estimation of v with OEM. 
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(c) Estimation of v with averaged BOEM. (d) Estimation of v with averaged OEM. 

Figure 5: Estimation of v using the online EM and BOEM (top) and their 
averaged versions (bottom). Each plot displays the empirical median (bold 
line) and the first and last quartiles (dotted lines) over 100 independent Monte 
Carlo runs with r„ = n^-^ and 7„ = 




Number of observations ^iq^ Number of observations 



(a) Estimation of xi with averaged BOEM. (b) Estimation of xi with averaged OEM. 



Figure 6: Estimation of xi using the averaged OEM and averaged BOEM. Each 
plot displays the empirical median (bold line) and the first and last quartiles 
(dotted lines) over 100 independent Monte Carlo runs with t„ = n^-^ and 7„ = 
The first ten observations are omitted for a better visibility. 



RML. 
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(a) Averaged BOEM. 



Number o( observations 



(b) Averaged RML. 



Figure 7: Empirical median (bold line) and first and last quartiles (dotted line) 
for the estimation of m{l, 1) using the averaged RML algorithm (right) and 
the averaged BOEM algorithm (left). The true values is m(l,l) — 0.5 and 
the averaging procedure is starter after 10000 observations. The first 10000 
observations are not displayed for a better clarity. 



4 Convergence of the Block Online EM algorithms 

It is shown in Section |4.2| that for any T > and any initial distribution x, 
the quantity ^^'"^(ff, Y) converges P^, — a.s when r — >■ +oo, to a deterministic 
quantity S{9) that does not depend on T and x- Therefore, the BOEM algorithm 
can be seen as a perturbation of the so-called limiting EM algorithm, defined as 
a deterministic iterative algorithm 0„ — R(0„_i) where 

R(0) =^ ^ (S(0)) . (9) 



The hmiting points of the limiting EM algorithm are identified (see section |4.3[ ) 
and it is shown in Section [T4| that BOEM inherits this limiting behavior pro- 
vided the perturbation can be set small enough. All the convergence results are 



addressed under the assumptions introduced in Section 4.1 

4.1 Assumptions 

A2 There exist cr_ and (7+ s.t. for any {x,x') e and any 8 £ Q, < a_ < 
me{x, x') < cr_|_. Set p'= 1 — ((t_/(t_|_) . 

This assumption is known in the literature as the strong mixing condition. It 
is commonly used to prove the forgetting property of the initial condition of 
the filter, see e.g. [HI This assumption holds for example if X is finite and 
for any {x,x') S X^, < infg me(x, cc') < supg me{x, x') < +oo. Under regu- 
larity conditions on the kernels {mgiO S O}, it also holds when X is compact. 
Nevertheless, it fails to hold in standard situations s.t. linear and Gaussian 
state-space models. It has been weakened in recent works: in [13 , the exponen- 
tial forgetting of the initial condition of the filter is proved with a local Doeblin 
property; |30| gives an uniform time average convergence of some particle fil- 
ters. The approach in [13' could be adapted to the present paper but at a quite 
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technical cost. For pedagogical purposes, we will assume A[2] throughout this 
paper. 

We now introduce assumptions on the observation process. Define the shift 
operator onto by {i}y)k = Yk+i for any k G Z; and by induction, define 
the s-iterated shift operator z?*+^y — ?9(i9^y), with the convention that d'^ is the 
identity operator. The shift operator is said to be ergodic for if for each set 
A in {AG 6(Y)®2; A = ^^-^(A)}, P^^) e {0, 1} (see P p.314]). 

A3-(7) [sup,,,,gx2 \S{x,x',Ya)\^] < +00. 

A4 (a) Under P*, Y is a stationary sequence. 

(b) The shift operator is ergodic with respect to P*. 

(c) [| log5_(yo)| + I logb+{Yo)\] < +(X3 where 

b-{y) inf / ge{x,y)X{dx) , b+{y) =^ sup / gg{x,y)X{dx) . 
see J eeeJ 

Finally, assumptions on the forgetting properties of the observations Y are 
required. For any sequence of r.v. Z {Zt}tez on {Q,V,T), let 

J"^ = a {{Zu}u<k) and a ({Zj„>fc) (10) 

be (T-fields associated to Z. We also define the mixing coefficients by, see [7], 

/3^(n)=sup sup |P(B|J"^) -P(B)| ,Vn > . (11) 

A5 There exist C G [0, 1) and /? G (0, 1) s.t. for any n > 0, 0^ [n) < C/3", 
where is defined in (111. 



Under i^[4|[a|), the shift operator preserves the measure P^ on (Y^,S(Y)®^). 
^is used to control the Lp-mean error between the limiting iJM map 0(S(0)) 

and a BOEM iteration 9 (^St^"^~^ {d both started from the same point 6. 

Examples of observation processes satisfying A|4|[b]) and A|5]include geometrically 
ergodic Markov chains as discussed in Section 2.1]. 

We conclude this set of assumptions by a condition on the block size se- 
quence. 

A6-(7) The block size sequence {t„}„>i satisfies X]fc>o ^fc '^^'^ 
4.2 Block Online EM and limiting EM algorithms 



Theorem 4.1. Assume ^ and J^^^- Let S : x Y ^ he a measur- 
able function s.t. J^(l) holds. For any 9 G Q, there exists a V-i,-integrable 
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r.v. denoted by Eg [S'(X_i, Xq, yo)|Y] s.t. for any probability distribution x on 
(X,S(X)), 



sup 



^e:o,r (S, Y) - Eg [S{X^i,Xo, Yo)\Y] 



<2 



sup \S{x,x',Yo) 



Define for all 9 E Q, 

S(0)1^'e, [Eg [SiX^i,Xo,Yo)\Y]] 
t-^ S{9) is continuous on O and for any T > 0, 



a.s. (12) 



(13) 



S(6i) - a.s 



(14) 



The proof of Theorem |4.1| is given in Section 6.1 Eqs ^ and ( 12 1 show that 
when {{Xt,Yt)}t^z is a HMM with transition kernels mg and gg, the Hmiting 
statistic Eg [S{X-i, Xq, Yq)\Y] is the a.s. hmit of the conditional expectation of 
S{X-i,Xo,Yo) given Y^r+i-.r when X^r ^ Xi whatever x is: 

Eg[SiX_i,Xo,Yo)\Y_r+i:r] ^ Eg [S (X.i , Xo,Yo)\Y] P,-a.s. 

S is the P^^-a.s. limit of the usual sufficient statistics of the EM algorithm 
when the number of observations grows to infinity. Hence, the limiting EM can 
be seen as an EM algorithm with the whole data set Y: since Y is stationary, 
for this limiting EM, the so-called sufficient statistics (in exponential HMM) 
depend on the observations only through the mean E^ [Eg [^(X-i, Xq, yo)|Y]]. 

As a consequence of (14 1, when r is large, the quantity S^''^{9,Y) is an 
approximation of S{0). Therefore, the BOEM algorithm ([6| is a perturbation 
of the limiting i?M algorithm ([9|. Based on this remark, we first address the 
convergence of the limiting EM and then we show that BOEM has the same 
behavior. 



4.3 Asymptotic behavior of the limiting EM 

The convergence of the limiting EM is addressed following the same approach 
as in [23 for the convergence of the EM algorithm. It relies on a Lyapunov 
function W w.r.t. to the map R and the set 

£ = {6 e 9; R{e) = 0} . (15) 

The existence of such a Lyapunov function is the key ingredient to identify the 
limiting points of the algorithm (|9|. 

Proposition 4.2. Assume J^Tj^ and Then R given by ^ is con- 

tinuous on and there exists a continuous function on O, W, s.t. 



14 



(i) For all eee, Wo R{e) - W{9) > . 

(ii) For all compact setJCc&\C, Mg^ic {W o R{9) - W{e)} > . 



Proposition |4.2| is proved in Section |6.2[ The following proposition gives 
a set of sufficient conditions for the convergence of the limiting EM algorithm 
9n — R.(^n-i) to the set £ (see |17( Proposition 9] for the proof). 



Proposition 4.3. Assume y4[7]{l[ jQ(l) and ^ A ssume in addition that for 

any M > 0, the set ICm {9 e 8; W(6') > M} is a compact subset of 9. 
Then, for any initial value 9o s.t. W(/C^(-g^^-) n C) has an empty interior, there 
exists Wi, s.t. {9k}k>o converges to {9 e £; W(0) = w^,}. 

It is well known that for EM, a natural Lyapunov function is based on the 
(normalized) log-likelihood of the observations (see e.g. [551). [T^, Lemma 2 and 
Proposition 1] shows that, under A2|3 the normalized log- likelihood converges 



and this limit, hereafter denoted by c*(6'), is deterministic and does not depend 
on the initial distribution x of the hidden chain. To prove Proposition |4.2[ we 
establish that the function W : 9 exp(c^(0)) is a Lyapunov function for the 
map R and the set C. It can be proved that under regularity conditions on 
the HMM, the set £ is the set of the stationary points of c^; this discussion 
is detailed in \21\ Theorem 14]. By Sard's theorem if W is at least dg (where 
Q Q K^^o) continuously differentiable, then W(£) has Lebesgue measure and 
hence has an empty interior. 

The assumptions on the compacity of the level sets ICm highly depend on 
the model. In [35] , the same assumption is used to prove that the limit points of 
the EM algorithm are the stationary points of the likelihood of the observations. 
In |11|[T7] . the stability of stochastic-type EM algorithms rely on these assump- 
tions. Since W is continuous, the compacity of the level set can be proved if 
limW(6l) = 0, as 6* ^ dQ. 

4.4 Asymptotic behavior of the Block Onhne EM algo- 
rithms 



Theorem 4.4 establishes the convergence of BOEM. Let Cl{A) be the closure of 
the set A. 

Theorem 4.4. Assume ^4[7][|[ J^(p2), J^l^and J^(pi) for some2 < pi <p2. 
Assume in addition that W(£) is compact and, for any M > 0, the level set 
{9 e 9; W(6') > M} is compact. Then, 

(a) linisup„p„ < +oo P^, — a.s where p„ is defined in 

(b) // W(£ n Cl({(?„}„>o)) has an empty interior, there exists w^, s.t. {6'„}„>o 
converges to {9 G C;W{9) = w*}. 



Theorem 4.4 implies that the number of truncations p„ in (|6| is almost 
surely finite so that for a (random) sufficently large n, 6'„ ~ 6n-i/2- It shows 
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that the BOEM algorithm and the limiting EM have the same asymptotic be- 
havior. The convergence of {9n}n>o is established under the same assumptions 



on W(£nCl({6'„}„>o)) as in Proposition 4.3 for the convergence of {9n}n>o (see 



above for comments on this assumption). The proof is detailed in Section 6.3 



it consists in applying the results of [17 on the convergence of a sequence gen- 
erated by iterated random maps, which are perturbations of a point-to-point 
map associated to a Lyapunov function. The key ingredient is to prove that 
the perturbation vanishes when the number of iterations tends to infinity; in 
our case, this is done through the control of the Lp-mean error when repla c- 
ing the limiting quantity 5(6'„_i) by St^ ""^(6'„_i,Y) (see Proposition 



6.5 



Section |6.3| . 

Then, we show that {W(6'„)}„>o converges to a connected component of 
W(£) and when W(£ n Cl({0„}„>o)) has an empty interior {W(0„)}„>o con- 
verges to a point w*. We then deduce the convergence of {6'„}„>o. 

A convergence result for the averaged BOEM algorithm can be obtained 
following the same lines as in the proof of Theorem |4.4[ The main ingredient for 
this proof is the control of the Lp-mean error when replacing S{9n-i) by E„ (see 
Lemma 6.7 below). It can be proved that, along any converging BOEM path, 
the averaged BOEM algorithm and the limiting EM have the same asymptotic 
behavior. Details are omitted for brevity. 



5 Rate of convergence of the Block Online EM 
algorithm 

We address the rate of convergence of {9n\n>o and {^?n}n>o, resp. given by (|6| 
and ([s]) to a point 9^, £ L (see Theorem 4.4 1. It is assumed that 



A7 (a) S and 9 are twice continuously difFerentiable on O and S. 

(b) There exists < 7 < 1 s.t. sp (Vs(S o ^)^^g^g^-)) < 7 where sp 
denotes the spectral norm. 

A8 (a) {t„+i/t„}„>o converges to q and 717 < 1. 

(b) limsup„ ELi{ ^ - 9 ^ + logTfcl/Vi; < 00. 



Under A[6] lim„ r„ = +00. A[8] strengthens A[6j i^|8j[a| is satisfied for geometric 
rates of the form t„ ^ ar" with r G (1,7^^), for polynomial rates t„ ^ cn^ with 
6 > and sub-exponential rates logr„ ^ cn}' with c > 0, G (0, 1), and more 
generally with sub-geometric rates. i^|8j|b]) is satisfied for geometric rates of the 
form Tn ~ aT" with t > 1, for polynomial rates of the form t„ ^ cn^ with 5 > 1 
and with any sub-exponential rates. 

Hereafter, for any sequence of random variables {^,i}ri>o- write Z„ — Oi^^{l) 
if limsup„E^ [|Z„|p] < 00; Z„ = e'a.s(l) if sup„ |Z„| < +00 - a.s and Z„ = 
Oa.s(l)if lim |Z„| =OP,-a.s. 

n— J-+00 
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Theorem 5.1. Assume ^(p2), ^(pj, 43 '^"'^ 41EI 

2 <pi < p2. Then, for any p <E {2,p2), 



U iii, 



Ol, (1) + -=Ol,/. (l)a.s (1) + Oa.s. (1) . (16) 



// in addition AMW holds, then for any p € (2,P2) 



OL,(l) + ^OL,,,(l)Oa.s(l) 



(17) 



The proof of Theorem 5.1 is given in Section 6.4 Eq. (16) shows that the 



error 9n — 6** is decomposed into two terms and the Lp-norm of the leading term 

1 /2 

is inversely proportional to r„' . Hence, the rate of BOEM is closely related to 
the choice of the number of observations per block. The first column of Table [I] 
gives explicit rates of convergence for different block-sizes. 



In (16 1, the rate is a function of the number of updates (i.e. the number of 



iteration of the algorithm) . This rate could also be interpreted as a function of 
the total number of observations up to iteration n. To that goal, let (f>{n) + 1 be 
the index of the block the n-th observation belongs to, i.e. (f>{n) is the largest 
integer s.t. 



E 

fc=0 



Tk <n < 



E 

fc=0 



Tk , (by convention. 



k=0 



0) 



The interpolated sequence {9n}n>o deduced from {0„}„>o is thus defined by 
— ^0(n) (the value of the interpolated sequence is kept fixed within each 
block). The second column of Table [T| gives the rate of convergence of this 
interpolated sequence (deduced from the square root of T^(n)) up to a multi- 
plicative constant. This rate of convergence is slower than n^^^^, except in the 
geometric case. Note however that the geometric case is of weak practical in- 
terest, since the parameter is hardly ever updated thus yielding to algorithms 
which are really sensible to the initial value 9o (see Section [s]) . 



Eq. (17) addresses the rate of convergence of the averaged BOEM algo- 
rithm. It shows that when the condition A|8]is strengthened in such a way that 
lim„ n/\/T^ = 0, averaging reduces the influence of the block-size schedule: the 

— . —1/2 

error 0n — 0* has a rate of convergence proportional to T„ i.e. to the inverse 
of the square root of the total number of observations up to iteration n. The 
last column of Table [T] shows that this averaging procedure gives an optimal 
rate of convergence, whatever the block-size sequence. 

As a conclusion, the averaged BOEM algorithm reaches the optimal rate of 
convergence even when the block size sequence {r„}„>o slowly increases, thus 
allowing polynomially increasing size of blocks. 
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^1/2 
/ n 




^ n 


7.1/2 




„6/2 


„fc/(2(6+l)) 


„(''+l)/2 




exp(cn'') , (b e (0. 1)) 


exp(0.5cn'') 


„i/2(in„)(''-i)/(26) 


^(l-fc)/2g^p(Q 5^^b) 


ni/2 


cr , (x e (1,^-1)) 


T-»/2 




T-"/2 





Table 1: Rate of convergence of both algorithms (up to a multipHcative constant) 

6 Proofs 

For p > and Z a random variable measurable w.r.t. the cr-algebra a {Y„, rt G Z), 
set ||Z|L^/^' (EJlZn)^/^ 

6.1 Proof of Theorem 14.11 



The proof of Theorem |4 . 1 1 relies on auxiliary results about the forgetting proper- 
ties of HMM. Most of them are really close to published results and their proof 
is provided in the supplementary material [21 , Section 4]. The main novelty is 
the forgetting property of the bivariate smoothing distribution. 

Lemma 6.1. Assume ^"^j^l^ Let y e s.t. sup^, \S{x,x' ,yi)\ < +00 for any 
i eIj. Then for any r > and any distribution x on (X, B(X)), 9 1— >■ '^^'q'^^S, y) 
is continuous on O. 

Proof. Set Ke{x,x' ,y) mg{x,x')ge{x' ,y). Let r > and x be a distribution 
on (X,B(X)). By definition of ^^'^'^(S', y) (see (jlj) we have to prove that 



is continuous for h{x,x',y) = 1 and h{x,x' ,y) ~ S{x,x',y). By A[I|[a|), the 
function 9 Tll^-r ^e{xi, Xi+i, j/i+i) h(x^i, xo, yo) is continuous. In addition, 
under for any 9 ^ Q, 



r-l 



Kg{xi,Xi+i,yi+i)h{x-i,xo,yo) 

l——r 

^ \h{x-i,xo,yo)\exp ^r4>{9) + (^tIj{9), ^ S'(xj, x^+i, j/^+i)^^ . 

Let /C be a compact subset of 8. By A[T] there exist constants Ci and C2 s.t. 
the supremum in G of this expression is bounded above by 

CiSup|/i(a;,a;',2/o)|exp C2 ^ sup\S{x,x,' ,yi+i)\ . 

x.x' \ x,x' j 

\ i— — r ' / 
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Since % is a distribution and A is a finite measure, the continuity follows from 
the dominated convergence theorem. □ 



Let us introduce the following shorthand Ss{x,x') S{x,x' ,Ys). For a 

function h, define osc(ft-) =^ sup^ \h{z) — h{z')\. Note that under A3-(l), 
[osc(S'o)] < +00. Under [21] Proposition 4.3(ii)] implies that for any 
9 E 0, there exists a r.v. $g (S", Y) s.t. for any r < s <T, 



sup 



(18) 



This concludes the proof of (12 1. For the proof of (14 1, we introduce the fol- 
lowing decomposition: for all T > 0, 

S^'^ie, Y)^lj2 ['^0 (S, ^*+^Y) + {$^;°, {S, ^^Y) - {S, #+^Y) } 



upon noting that by ^, S^''^{9,Y) ^ r'^ E Li ff t-r {SJ'^Y). By ^, (isj 
and A[3]-(1) (S*, Y)|] < +00 . Under A4 ^l', the ergodic theorem (see 

e.g. [DTheorem 24.1, p.314]) states that 

1 ^ 

lim -V $g (5,i9*+^Y) =EJ$e(5,Y)] - a.s 



for any fixed T. By (18 1, 



\jl\^flr (^,^^Y)-$,(5,^*+^Y)| < 1^(p-*+p*-i)osc(5,+t) 
t=l t=l 

Set Zt '= J J2\=i osc(S's+t) and Zq 0. Then, by an Abel transform, 
- Vp*-1osc(5,+t) = P^-'Z, + ^ V V^^Z, . 



(19) 



(20) 



t=i 



t=i 



Under A4|[^|bl and A|3]-(1), the ergodic theorem implies that MuIt-^qo Z^ = 
E^ [osc(S'o)] P* — a.s. Therefore, limsup^ Zr < 00 Pi, ~ a.s. Since J2t>i ^P*^^ < 
00, this impHes that St'=i P*~^osc{St+T) — > Pj, — a.s. Similarly, 

r — ^+00 

r 1 r— 1 

- ^ p^-*0Sc(5t+T) =Zr-{l-p)J2 P^~'^'Zt + ^ ^ tp'-^Z,_t . 
t=l t=l t=l 

We have limT-_j.oo YlIZi ip^~^'Z'r-t = 0, P* — a.s by using the same arguments 
as for the second term in (20 1. Furthermore, 

"^"1 T— f — 1 r — 1 1 

P I c< \^ 



+ E^ [osc(S'o)] p^^i 
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Since Zr — > [osc(>S'o)] — a.s, the RHS converges — a.s to and 



lim 

r— >-+oo 



Hence, the RHS in ( 19 1 converges P^, — a.s to and this concludes the proof of 
(14). We now prove that the function 6* !—> [<i>g (5, Y)] is continuous by ap- 
pHcation of the dominated convergence theorem. From |21l Proposition 4.3(ii)], 
for any y s.t. osc(S'(-, •, j/o)) < co, 



Hm sup 



= 



Then, by Lemma 6.1 Q ^ {S, y) is continuous for any y such that osc(S'(-, •, yo)) < 
+00. In addition, supggg, |$g (-S*, Y)| < sup^, ^.z \S{x,x' ,Yo)\. We then conclude 
byAl3}(l). 



6.2 Proof of Proposition |4.2 

Set 



£^;°(Y) ='log|^y" x{dxo)i^l[me{xt^i,Xt)ge{xt,Yt)^ A(dxi) • • • A(da;T)^ • 

( Continuity of R and W ) By A[l][c| and Theorem 4. 1 the function R is con- 
tinuous. Under AfTpl and ^[41 there exists a continuous function on 8 s.t. 



limy T^^^g y (Y) — Ci,{9) — a.s for any distribution x on (X,S(X) and any 
9 & Q, (see [TH Lemma 2 and Proposition 1], see also |2H Theorem 4.9]). 
Therefore, W is continuous. 

Proof of Proposition \4.^ ^ For all T > and all 6 e Q, define 



Pe 



{xo:T,yi:T) = J|"^0(2:j-i,a:;i).9e(a;i,'Ki) 



(21) 



Under Assumption ^^[l]0 
1 



T 



\ogpe{xo:T,Yi.,T) 



Upon noting that 
/ S{xt-^i,xt,Yt) 



P9ixO:T,Yi.,T) 



J pe(^o:T,yi:T)A(dzi:T)x(dzo) 



A(dxi.T)x(dxo) = $^.f.j,(5,Y) , 
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a classical use of the Jensen inequality gives, — a.s. 



y^Rf«).T(Y) - ^^^:t(Y) > 0(RW) 



I t=l 



$^'%(5,Y),^(R(0))^ 



T ^ 

I t=i 



(22) 



Under A[T|4] it holds by Theorem 4.1 and |14[ Lemma 2 and Proposition 1] (see 
also [HJ Theorem 4.9(ii)]) that for all 6* € 9, - a.s. 



t=i 

Therefore, when T 



S(0), 



^^;°(Y) ^-^^lnW(0). 



oo, ([22| implies 

In (W(R(0))/W(0)) > 0(R(f?)) + (S(0), V'(R(e))) - m - (S(0), ^(0)) 
By definition of Q and R (see and ([9|), the RHS is non negative 



(23) 

_ . .,, ^ This 

concludes the proof of Proposition 4.2 i]). 

Proof of Proposition ^.^ We prove that WoR(0) — W((?) = if and only if 
e € C. Since WoR-W is continuous, this implies that inf WoR(6') - W(6l) > 

for all compact set /C C 9 \ £. Let 6* G 9 be s.t. W o R^^) - W(6') = 0. Then, 
the RHS in (23) is equal to zero. By definition of 9, R(6') — and thus 9 & C. 
The converse implication is immediate from the definition of C. 



6.3 Proof of Theorem 14.41 



The proof of Theorem 4.4 follows the same lines as the proof of |17i Theorem 



3]. The key ingredient for this proof is the control of the Lp-mean error be- 
tween the Block Online EM algorithm and the limiting EM. This is the crucial 



difference with [17 . The proof of this bound is derived in Proposition 6.5 and 
relies on preliminary lemmas; the detailed proof of Theorem 4.4 is given in |21l 
Section 3.1]. 

In the sequel, for all function S on 9 x and all 0^, e 9, we denote by 
[5(6', Y)]g^g^ the function 6* i-^ [5(6*, Y)] evaluated at 6* = 9^. Finally, for 
any i > 1, m > 1 and any distribution x on (X,S(X)), define 



(0, Y) mjUS,Y) E, ['^l-,Z{S,Y)l 



(24) 



Lemma 6.2. Assume ^ ^(p), and ^f or some p > 2. Letp e {2,p). 

There exists a constant C s.t. for any distribution x on (X,S(X)), any m > 1, 



k,i>Q and any Q-valued J-q 



-measurable r.v 



0, 



.X 

'2um+t,7n 



< C 



where An 

^ pp 



and [3 is given by 
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Proof. For ease of notation x is dropped from the notation K2um m- 
Berbee Lemma (see f27l Chapter 5]), for any m > 1, there exists a 0- valued r.v. 
e* on (17, J",P*) independent from (see Eq.([lO|) s.t. 



\{e^e*}^ sup \v4B\a{e))-v^{B)\ 
seer 



(25) 



dcf , 



Set Lu = 2um + I. We write 



ii — 1 u — 1 

^ «:i„,„.(r, Y) + fc {e, [$J;-™(5, Y)]_^. - [<i>^-:^(5, Y)]_J 



(26) 



By the Holder's inequality with a =^ p/p and 5 ^ 1 — a ^, 



< 



X-L-m 
e,LX+m 



{S, i9^Y) - $ 



X,L-m 
e* ,L,L+m 



{S,Y) ¥,{9^9*} 



Ap 



By i5i[4j[a|), i^(p), ([25]) and ([TT]), there exists a constant Ci s.t. for any 

m,L >1, any distribution x ^^nd any 0-valued J^p'^-measurable r.v. 6, 



X,L-m 



Similarly, there exists a constant C2 s.t. for any m > 1, any distribution % and 
any 0-valued J^p'^-measurable r.v. 9, 

E. [<i>^;o:™(^, Y)]_^, - E, [<i>^:o;;':(^, Y)]_^ < c^r^^ . 



Let us consider the second term in (26). For any u> 1 and any u e 0, the r.v. 
KL„.m('L', Y) is a measurable function of Y^ for all — m + 1 < i < + m. 
Since L„ > 2um, for any u € 0, X]t=i Y) is t/^-measurable. 9* is 
independent from so that: 



^Ati„,„(r,Y) 









fc 


P- 




E, 


E, 




KL„,m(u, Y) 

u=l 




v=0*- 



l/p 



Define the strong mixing coefficient (see [7]) 

aY(r-) li'^sup sup \V^{AC] B) ^V^{A)V^{B)\ ,r >Q . 
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Then, [TJ Theorem 14.1, p. 210] implies that for any m > 1, the strong mix- 
ing coefficients of the sequence kj^j {fiL^.m{v,Y)}u>i satisfies a'*'">)(i) < 
a^{2{i - l)m + 1). Furthermore, by [271 Theorem 2.5], ~ 



^ KL„,m(v, Y) 



< 



i/p 



where iV(,„)(i) X]i>i Iq^c™) (i)>t ^^^^ Qtj,™ denotes the inverse of the tail 
function t P^,(|kl^.,„(u, Y)| > t). The sequence Y being stationary, this 
inverse function does not depend on u. By i^and the inequality (r) < fP^ {r) 
(see e.g. [TJ Chapter 13]), there exist /3 e [0,1) and C G (0,1) s.t. for any 
u,m > 1, 



i>l 



2m log p 



Let ?7 be a uniform r.v. on [0, 1]. Observe that < 1. Then, by the Holder 

inequality applied with a p/p and =^ 1 — a~^, 

i/p 



dcf 



[iV(„,)(C/) A k] Q.,™([/) / [iV(„,)(^.) A fc]^/' Ql^^{u)du 



(C/)l 



< <^ (C/32™'=)^P/ci/2 + 



-1 


1/2 


_2m log/3_ 





Q.,m(C/) -log 



1/2 



-1 


1/2 




2 m log [3 







|Q..m(C^)llp • 



pb 



Since i7 is uniform on [0,1], Qv,m{U) and |Ki^^_,„(z;, Y)| have the same distri- 
bution, see [27]. Then, by |21l Lemma 4.5] and i^(p), there exists a constant 
C s.t. for any v G Q, any m > 1, 



sup \\Q 

v.m 



sup |5'(a;,x', Yo) 



which concludes the proof. 



□ 



Lemma 6.3. Assume ^(p), and for some p > 2. Letp£ (2,p). 

There exists a constant C s.t. for any n > 1, any 1 < rrin < t„+i and any 
distribution x on (X, S(X)), 



2Vnmn 



< c 



where k¥ and /3 are defined by (24 1 and A5 



dcf 



2m„ 



and Ap=^ 
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Proof. We write, 



2m„-l 

^ E 

f=0 



Observe that by definition 0„ g . Then, by Lemma 
constant C s.t. for any nin > 1 and any £ > 0, 



6.2 



there exists a 



< c 



The proof is concluded upon noting that Tn+i > 2m„w„. 



□ 



Lemma 6.4. Assume j^(p) and for somep>2. For any p G {2,p], 

there exists a constant C s.t. for any n> 1, any 1 < m„ < qn Tn+i 0,^1(1 any 
distribution x on (X, S(X)), 



< C 



P 



w/iere p„ =^ J2t=2ni„ '^t,mj^n,'&^"^) and is de/ined &?/ ([24l 

Proof By Q and ([sl, 5^;^^ (0„, Y) - S(0„) - p„ = ^ ti 9^,n where 

-r„+i 

9ir) —1 

dof 1 



53,: 



dof 



1 

Tn+1 



t=l 

r„+i 



In the case r„4-i > 2m„, it holds 

T„ + l 



T„+l|5l,„|< ^ 



where we used |21l Proposition 4.3(i) and Remark 4.4] in the last inequality. 
By J^ip) and ^140, there exists C s.t. < C (p"" + T.nli) In the 



24 



case Tn+i < 2m„, it can be proved along the same lines that |l5i,n|l^p ^ 
C ('p'^'i+i-™" + T^^j). For g2,,i and 53, „, we use the bounds 



< sup \S{x,x',YT„+t) 



sup \S{x,x',Yo) 

{x,x')ex^ 



Then, by AHE 



< 2 



sup |S'(a;,a;',Fo)| 

(a;,a;')GX2 



and the RHS is finite under A|3]-(p). Finally, 

|54.„| < 2p""-iE, [osc{5(-,-,:^o)}] 
where we used Theorem |4.1[ This concludes the proof. 



□ 

Proposition 6.5. Assume ^(P), ^TjEI '^"'^ 41 some p > 2. Let 
p € (2,p). There exists a constant C s.t. for any n > I and any distribution x 
on (X,B(X)), 

%^!^(0„,Y)-S(0„) <^^. 

Proof. Let m„, ?;„ be positive integers s.t. 1 < m„ < t„^i and t„+i = 2u„to„ + 
Tn, where < r„ < 2m„. Set Ap 1/p — 1/p. By the Minkowski inequal 
ity combined with Lemmas 



6.3 



6.4 



dcf 



constant C s.t. 

%^^(0„,Y)-S(0„) 



< C 



applied with g„ = 2w„m„, there exists a 



The proof is concluded by choosing m„ = [— logT„+i/ (logp V Aplog/3)J . □ 
6.4 Proof of Section [5] 



6.4.1 Proof of Theorem 5.1, Eq. (16) 



Define Si, S(0*) and 



5*0 %° (00 , Y) and 5„ (0„ , Y) , Vn > . (27) 
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We have: 



where p„ is defined by ^ and {0„}„>o is defined by By Theorem 4.4 a|), 
the number of truncations is a.s. finite so that the second term is Oa.s(l). We 
write, for the first term, 



(28) 



dcf 



where T = V0{s^). We now derive the rate of convergence of the quantity 
- s^. Set G(s) =^ S o 0{s). Note that under J^j^, sp{T) < 7, where 
r =^ VG(s*), since T = VS(6i*) • V^(s*). Since G(s*) = s*, we write 

Sn -S,=r {S„-l - S.) + Sn - G{Sn-l) + G{Sn-l) - G{s,) - T (5„_i - S,) . 

Define {^„}„>o and {p„}„>o s.t. fiQ ^ 0, po = Sq - and 

Hn '= r^„_i + e„ , p„ =^ S'n - - ^„ , n > 1 , (29) 

where. 



Cn " Sn S(0„) , 



n > 1 . (30) 

Proposition 6.6. Assume ^ ^(p2), ^^H^ ^(Pi), J^and fo r some 

2 < pi <p2. Then for any p G (2,^2), ^/TV^Mn = C'Lp(l) and TkPkhim„ s„=s, = 

OL,,,(l)Oa.s(l). 



The proof of Proposition 6.6 is on the same Hues as the proof of [17, Theorem 
6]. The main ingredient is the control of ||/^n||^ p which is a consequence of |25[ 
Result 178, p. 39] and Proposition 6.5 The detailed proof is thus omitted and 



postponed to the supplementary material [21] Section 3.2]. 



By Proposition 6.6 the first term in (28) gives 



-oAg(s.)ee,„ = 0L.(l) + -=OL,/,(l)a.s (1) 



A Taylor expansion with integral remainder term gives the rate of convergence 
of the second term. 



6.4.2 Proof of Theorem 5.1 , Eq.( 17 ) 



We preface the proof by the following lemma. 

Lemma 6.7. Assume ^(P2), ^4|7][5| ^ for some p2 > 2. For any 

P& (2,P2), 

limsup ^=^= ^Tfc+iefc < 00 , 
fc=i 



where e„ is given by (30 I 
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Proof. Let p E {2,p2). In the sequel, C is a constant independent on n and 
whose value may change upon each appearance. Let 1 < m„ < t„+i and set 



def 



2m„ 



. By Lemma 



6.4 



dcf 



applied with qk — 2vkmk, we have, 



k=l 



, fc=i 



k=l 



where 6k and Cfc are defined by 

4=' E {^t,fe(^fe,Y)-Eji^,,fe(0fe,Y)|j-^J} , 

£— 2mfc 
2vkrnk r 

t=2mfc ^ 

where Ft.k{Ok,Y) (5, Y) and 7"^ is given by (|lo|). We will 

prove below that there exists C s.t. 



iiaiL,p<^^/3"'=/^'T,+i , vfc>i 



(31) 



fc=i 



< CjT^i + C ^ , Vn > 1 (32) 



k=l 



dcf 



SO that the proof is concluded by choosing = [r/ log Tk+i\ , i] — ( — 1/ log p) V 
(-pfe/log/3)- 



We turn to the proof of (311. By the Berbee Lemma (see |27j Chapter 5]) 
and there exist C € [0, 1) and /? e (0, 1) s.t. for all fc > 1, there exists a 
random variable i^^^^^^.jn^^^^„j^ on (fJ, J^, P^,) independent from with the 
same distribution as YT^^mk-.Tt+i+irn. and 

IP* {^Tfc+mfc:Tfc + i+mfc ^ ^Tfe+mfc :Tfc + i +mfc } < C/S'"" . (33) 

Upon noting that [^^(,,-(0,-, Y*'W) | J-^J = E, [^^t,fc(e, Y)] , we have 



Ck^ {EjJ^t,fe(^fe,Y)|j-Y] _E, 



Pt,k{(^k,Y 



} ■ (34) 



Therefore, by setting A =^ {^^Z+mkiT^+i+m, 7^ Yn+m^-.n+.+mJ, 



lai < E 

t=2mfc 



sup 

flee 



^^t,fc((?,Y)-Ft.fc(0,Y*'W; 
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Minkowski and Holder (with a =^ P2IP and b ^'^^l — a ^) inequalities, com- 
bined with ([33|, HH Lemma 4.5] and A[3}(p2) yield 

We now proveT32l. Upon noting that 5k is ^-measurable and 5k is a 
martingale increment, the Rosenthal inequality (see [181 Theorem 2.12, p. 23]) 



States that E^^, 4IL,p < C f ^Li 4'^ ) ' ' + where 



r(2) 



/W'^=i^E,[|<5,n and I^^^) ''^ 



( n 



1/2 



Using again [F^,fc(^^fc, Y^'C^)) 1 7"^] = [Ft,k{e ,Y)]^^^^ and (341 

2'Ufcmt 



^ {Fi,fc(0fc,Y)-EjF,,,(0,Y)],^,J 



t=2mk 



By Lemma 6.3 and (31 1, there exists C s.t. for any fc > 1 



(35) 



and since 2/p < 1, convex inequalities yield (X]fc=i^fc ) — 



ri+l ^ 

C^^^j^ Tk+iP'^''^^^ ■ By the Minkowski and Jensen inequalities, it holds I^'^ < 

Hence, by {ib), < CyiWT + C ^Li ^fc+i/?"''/^'- 

□ 



This concludes the proof of ( 32 1 . 
We write E„ ~ s^, — fin + Pn with 

n n 
_ dcf 1 , _ dot 1 

= TfT and p„ = — 2_^TkPk~l ■ 

^" fe=l ^" fe=l 

Proposition 6.8. Assume ^ ^(p2), 431 4^(Pi)' ^and J^fo 
2 <pi < p2. For any p e (2,^2), 

= Ol,(i) , — p„iiim„s„=.. =Or ,,(i)a.s(i) . 

Proof. Set A =^ (/ - qT). Under i^j?] exists. By ^ and (|36]), 



(36) 



Tn+lPv 



T ^/T 



_^ n 1 ^ 



fc=i 



fc=l 



Tfc+1 



fc-i 



The result now follows from Proposition 6^ Lemma 6.7 and The proof of 
the second assertion follows from (36 1 and Proposition 6.6 □ 



28 



Upon noting that 6-^ = 9(si,), we may write, for the averaged sequence, 
9,, - = T(S]„ - s,) + ^(E„) - 9{s,) - T(E„ - s,) . 
The first term in tliis decomposition gives 



/ Ti 

' -^71 



By A|7j|b|, as for the non averaged sequence, a Taylor expansion with integral 
remainder term gives the result for the second term. 
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